This is a draft version and the notebook is incomplete and the statements do not match the data at this point. It will be fixed soon.
Welcome! In this Jupyter notebook, you will use scikit-learn to build and predict using the $k$-Nearest Neighbour (or $k$-NN) algorithm. We will also talk distance measures, a bit. But that's enough, let's jump into it.
In order for the notebooks to function as intended, modify only between lines marked "### begin your code here (__ lines)." and "### end your code here.".
The line count is a suggestion of how many lines of code you need to accomplish what is asked.
You should execute the cells (the boxes that a notebook is composed of) in order.
You can execute a cell by pressing Shift and Enter (or Return) simultaneously.
You should have completed the Decision Trees Jupyter notebook before attempting this one as the concepts covered there are not repeated, for the sake of brevity.
Again, we are loading required packages. From scikit-learn, we will load the packages appropriate for k-NNs this time.
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
Let's turn off the scientific notation for floating point numbers.
np.set_printoptions(suppress=True)
We will load our data from a CSV file and put it in a pandas an object of the DataFrame class.
This is the Iris flower dataset, a very famous dataset which was published in 1936 by studying three different species of the flower Iris: Iris setosa, Iris versicolor and Iris virginica. Originally, the dataset has 150 examples which corresponds to 150 different Iris flowers measured (50 flowers from each species). The dataset has 4 features, the sepal length, sepal width, petal length and petal width of each flower in centimeters. However, for the purpose of ease of illustration, we have chosen only two of the features: the sepal length and sepal width.
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
df = pd.read_csv('data_knn.csv')
Let's take a look at our data as a table:
df
Because now our data has 2 features, we can use a single scatter plot to take a look at our data:
fig = px.scatter(df, x="Sepal Length", y="Sepal Width", color="Species")
fig.show()
Let's extract our data and targets as NumPy arrays X and y, from pandas data frame df. This time, for visualization purposes, we need our targets y to be integers instead of text, so we have to extract text labels as y_text and then transform them into integer labels y. Fortunately, we can do that with the LabelEncoder class that comes in scikit-learn:
X = df.drop('Species', axis=1).to_numpy()
y_text = df['Species'].to_numpy()
y = LabelEncoder().fit_transform(y_text)
Let's see our data X:
X
Let's check the shape of Xas well:
X.shape
Let's do the same checks with the targets vector y:
y
...and the shape of y:
y.shape
Everything looks good.
Again, let's split our data into training, validation and test sets. Let's use 60% (90 examples) for training, 20% for validation (30 examples) and the remaining 20% (30 examples) as test data.
(X_train, X_vt, y_train, y_vt) = train_test_split(X, y, test_size=0.4, random_state=0)
(X_validation, X_test, y_validation, y_test) = train_test_split(X_vt, y_vt, test_size=0.5, random_state=0)
Now, let's build our $k$-NN model! We will create an object of the class KNeighborsClassifier and assign the name knn for the resulting object. Remember that we have to specify a number for $k$ which is called n_neighbors in scikit-learn implementation.
However, before doing that let's have a quick discussion about distance measures which greatly affect how our $k$-NN classifier performs: Remember that you have to use your knowledge of the data to come up with a distance measure that makes sense for the data you have. You also may have to scale different features by different weights to get a good diatnce measurement out of their combination, before sttempting to train and use a model. For example, if we have integer data, it may make sense to use a Manhattan distance measure. Or use Hamming distance for bit data.
In our data here, all our measure ments are of length and are in centimeters, so no adjustment is needed and a Euclidian distance measure actually makes a lot of sense, since it is measuring some kind of length and the unweighted combination is also sound. For KNeighborsClassifier, the default distance measure or metric (as the argument is called in scikit-learn) is the $L_p$ measure or the Minkowski distance (metric=minkowski in scikit-learn call) with $p=2$ (p=2 in KNeighborsClassifier arguments), which is nothing other than the $L_2$ distance measure or the Eucliadian distance.
So, other than setting n_neighbors to a suitable value, we don't have to specify anything else. Let's start with a value of 1 for n_neighbors.
You can see the documentation for KNeighborsClassifier here:
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
Go ahead and implement that now:
knn = KNeighborsClassifier(p=2, n_neighbors=1, metric="minkowski")
Next, let's fit our knn to our X_train and y_train. This does nothing but store the training example as $k$-NN is "lazy": It does all calculations as prediction time and by measuring the distance from the operational datapoints provided (whose labels have to be predicted) to each of the training datapoints, finding the closest training datapoints to the operational points, looking at the labels for those closest training datapoints, and finding the majority class among them.
knn.fit(X_train, y_train)
detail_steps = 500
(x_vis_0_min, x_vis_0_max) = (X[:, 0].min() - 0.5, X[:, 0].max() + 0.5)
(x_vis_1_min, x_vis_1_max) = (X[:, 1].min() - 0.5, X[:, 1].max() + 0.5)
x_vis_0_range = np.linspace(x_vis_0_min, x_vis_0_max, detail_steps)
x_vis_1_range = np.linspace(x_vis_1_min, x_vis_1_max, detail_steps)
(XX_vis_0, XX_vis_1) = np.meshgrid(x_vis_0_range, x_vis_1_range)
X_vis = np.c_[XX_vis_0.reshape(-1), XX_vis_1.reshape(-1)]
yhat_vis = knn.predict(X_vis)
YYhat_vis = yhat_vis.reshape(XX_vis_0.shape)
region_colorscale = [
[0.0, 'rgb(199, 204, 249)'],
[0.5, 'rgb(235, 185, 177)'],
[1.0, 'rgb(159, 204, 186)']
]
points_colorscale = [
[0.0, 'rgb(99, 110, 250)'],
[0.5, 'rgb(239, 85, 59)'],
[1.0, 'rgb(66, 204, 150)']
]
fig2 = go.Figure(
data=[
go.Heatmap(x=x_vis_0_range,
y=x_vis_1_range,
z=YYhat_vis,
colorscale=region_colorscale,
showscale=False),
go.Scatter(x=df['Sepal Length'],
y=df['Sepal Width'],
mode='markers',
text=df['Species'],
name='',
# showscale=False,
marker=dict(
color=y,
colorscale=points_colorscale
)
)
],
layout=go.Layout(
xaxis=dict(title='Sepal Length'),
yaxis=dict(title='Sepal Width')
)
)
fig2.show()
Again, you will get a summary for the model:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=None, n_neighbors=1, p=2, weights='uniform')
Now, let's visualize our $k$-NN model. We are using plotly heatmaps (which are a bit more tedious to produce at this moment in time as they have not been ported to plotly express yet) to show regions where points will be predicted in different classes (this will take some time) and we will overlay a scatter plot of training points on top of that:
Evaluation time! Let's first see how does or $k$-NN does on training data. We expect to get an accuracy result of near 100% or 1.0. Not exactly 100% because the algorihthm in scikit-learn does not count the same point as a neighbour. Otherwise, we would have had a perfect 100% since the closest single point to each training point in the training points is that point itself and the classes predicted will match that of the training targets. So go ahead and calculate yhat_train using knn and X_train:
yhat_train = knn.predict(X_train)
Now, to measure the accuracy from yhat_train and y_train:
accuracy_score(yhat_train, y_train)
It is ??.??%. It is as we expected. And we also expect the accuracy on validation points not be that high, since we are probably overfitting by using $k=1$. Put in the line that generates yhat_validation, so we can measure the accuracy:
yhat_validation = knn.predict(X_validation)
accuracy_score(yhat_validation, y_validation)
Okay, we got ??.??% which has a big gap with the accuracy on training data.
Let's use a higher value of $k$ or n_neighbors. Let's use 3 and see what happens. Go ahead a make a new model knn3 and provide X_train and y_train to it afterwards using the methd fit:
knn3 = KNeighborsClassifier(p=2, n_neighbors=3, metric="minkowski")
knn3.fit(X_train, y_train)
Our model summary should indicate n_neighbors=3 now:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=None, n_neighbors=3, p=2, weights='uniform')
We can visualize the model with $k=3$:
yhat3_vis = knn3.predict(X_vis)
YYhat3_vis = yhat3_vis.reshape(XX_vis_0.shape)
fig3 = fig2
fig3['data'][0]['z'] = YYhat3_vis
fig3.show()
You can see the regions are smoother and less "patchy".
Now, predict yhat_train3 with this new knn3, so we can the accuracy on training data:
yhat_train3 = knn3.predict(X_train)
accuracy_score(yhat_train3, y_train)
We got ??.??% This time we did not predict to near 100% accuracy since the training points surrounding a training point may have different labels. On to validation accuracy then. Let's predict yhat_validation3 and we can the accuracy on validation data:
### begin your code here (1 line).
yhat_validation3 = knn3.predict(X_validation)
accuracy_score(yhat_validation3, y_validation)
We get a better accuracy on validation points (??.??%), again as expected, since we are probably overfitting less. Let's try a $k=5$ as well. Make a new model knn5 with n_neighbors=5 and provide the data to it via fit:
knn5 = KNeighborsClassifier(p=2, n_neighbors=5, metric="minkowski")
knn5.fit(X_train, y_train)
You should get:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=None, n_neighbors=5, p=2, weights='uniform')
We can visualize the model with $k=5$ as well:
yhat5_vis = knn5.predict(X_vis)
YYhat5_vis = yhat3_vis.reshape(XX_vis_0.shape)
fig4 = fig2
fig4['data'][0]['z'] = YYhat5_vis
fig4.show()
You can see the regions are even more smooth.
Next, let's evaluate our model. First on training data. fill in the part that calculates yhat_train5 and we can measure the accuracy on training data:
yhat_train5 = knn5.predict(X_train)
accuracy_score(yhat_train5, y_train)
??.??% Do the same on validation data by predicting yhat_validation5:
yhat_validation5 = knn5.predict(X_validation)
accuracy_score(yhat_validation5, y_validation)
So, we get ??.??% on validation set.
We got our best result with $k=3$ (??.??% > ??.??% > ??.??%), so let's get a final evaluation on our held-out test set. Predict yhat_test3:
yhat_test3 = knn3.predict(X_test)
accuracy_score(yhat_test3, y_test)
The accuracy on test data (??.??%) is both close to performance on validation and training data and is high.
We have a good $k$-NN model now.